[SPARK-9793] [MLlib] [PySpark] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly#8166
[SPARK-9793] [MLlib] [PySpark] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly#8166yanboliang wants to merge 7 commits into
Conversation
|
Test build #40766 has finished for PR 8166 at commit
|
|
Jenkins, test this please. |
|
Test build #40949 has finished for PR 8166 at commit
|
There was a problem hiding this comment.
nit: since k1 will be at most == v1_size due to the earlier while, checking for == here will suffice and is easier to read
There was a problem hiding this comment.
Actually I think checking k1 >= v1_size is more robust than k1 == v1_size, and Scala code also use the former one.
There was a problem hiding this comment.
OK, that's fine with me
|
LGTM after docstring change |
|
Test build #41666 has finished for PR 8166 at commit
|
|
@yanboliang Please update the PR to use the first 128 nonzeros entries to compute hash. |
d63d54e to
3b8ac7a
Compare
|
Test build #42420 has finished for PR 8166 at commit
|
There was a problem hiding this comment.
We can make the code more readable:
if isnan(value):
value = float('nan')
return struct.unpack('Q', struct.pack('d', value))[0]
PySpark DenseVector, SparseVector
__eq__method should use semantics equality, and DenseVector can compared with SparseVector.Implement PySpark DenseVector, SparseVector
__hash__method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.